Skip to content

Avoid repetitive creation of fp4/fp8 native-custom-op domains for NvTensorRtRtx EP#27192

Open
vishalpandya1990 wants to merge 4 commits intomicrosoft:mainfrom
vishalpandya1990:vipandya/debug_1
Open

Avoid repetitive creation of fp4/fp8 native-custom-op domains for NvTensorRtRtx EP#27192
vishalpandya1990 wants to merge 4 commits intomicrosoft:mainfrom
vishalpandya1990:vipandya/debug_1

Conversation

@vishalpandya1990
Copy link
Contributor

@vishalpandya1990 vishalpandya1990 commented Jan 28, 2026

Description

  • Avoid repetitive creation of FP4/FP8 native custom-ops in create method for custom-op domains (leaving plugin-based custom-op handling as is, as it was before native-custom-ops addition in PR-26555).
  • Avoid deleting the custom-op domains at destructor time, since those are created with static scope, so avoid potential double-delete.

Motivation and Context

  • Repetitive checks and creation of custom-ops domain is redundant. So, cleaning it up a bit.
  • Explicit deletion of static objects in destructor can lead to double-delete. So, avoiding it.

@vishalpandya1990
Copy link
Contributor Author

CC @yuslepukhin @tianleiwu

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the NvTensorRTRTX execution provider's custom ops initialization to improve efficiency and correctness. It introduces a flag-based approach to avoid redundant initialization of FP4/FP8 native custom ops and removes potentially dangerous manual deletion of statically-allocated objects.

Changes:

  • Added native_custom_ops_initialized flag to track whether native custom ops (TRT_FP4DynamicQuantize, TRT_FP8QuantizeLinear, TRT_FP8DequantizeLinear) have been created
  • Restructured control flow to add already-initialized native ops to the domain list without re-creating them
  • Made release functions no-ops since the custom op domains are static objects managed by unique_ptr with static storage duration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vishalpandya1990
Copy link
Contributor Author

Error from the logs in failing job -

2026-01-30T10:44:34.0610538Z FAILED transformers/test_gqa.py::TestGQARegressions::test_gqa_rope_separate_qkv_bug - AssertionError: Torch not compiled with CUDA enabled

This doesn't look related to the change. Not sure if there is any regression in ToT.

@vishalpandya1990
Copy link
Contributor Author

Error from the logs in failing job -

2026-01-30T10:44:34.0610538Z FAILED transformers/test_gqa.py::TestGQARegressions::test_gqa_rope_separate_qkv_bug - AssertionError: Torch not compiled with CUDA enabled

This doesn't look related to the change. Not sure if there is any regression in ToT.

I have synced the branch.

@vishalpandya1990
Copy link
Contributor Author

@yuslepukhin @tianleiwu Can I get a review for this?

@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 9, 2026

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look correct and robust.

Thread Safety: The use of static std::mutex effectively protects the initialization logic.
Resource Management: Removing the manual delete corresponds correctly to the static ownership model.

Logic Correctness: The separation of native_custom_ops_initialized check ensures that native ops are returned correctly even if custom_op_domain is empty, while preventing duplicate initialization.

The PR improves stability and prevents potential memory corruption.

@vishalpandya1990
Copy link
Contributor Author

vishalpandya1990 commented Feb 10, 2026

Error in React Native CI Pipeline / React Native CI Android (pull_request) doesn't look related.

Raw Logs - link

2026-02-10T08:53:29.1912762Z error: Command failed: "/usr/local/lib/android/sdk/emulator/emulator" -version 2026-02-10T08:53:29.1913515Z /usr/local/lib/android/sdk/emulator/qemu/linux-x86_64/qemu-system-x86_64: error while loading shared libraries: libpulse.so.0: cannot open shared object file: No such file or directory


2026-02-10T08:53:41.3067325Z @Thread mqt_native_modules(47): 2026-02-10T08:53:41.3068251Z java.lang.UnsatisfiedLinkError: No implementation found for void ai.onnxruntime.reactnative.OnnxruntimeModule.nativeCleanup() (tried Java_ai_onnxruntime_reactnative_OnnxruntimeModule_nativeCleanup and Java_ai_onnxruntime_reactnative_OnnxruntimeModule_nativeCleanup__) 2026-02-10T08:53:41.3069236Z at ai.onnxruntime.reactnative.OnnxruntimeModule.nativeCleanup(Native Method) 2026-02-10T08:53:41.3069746Z at ai.onnxruntime.reactnative.OnnxruntimeModule.invalidate(OnnxruntimeModule.java:39) 2026-02-10T08:53:41.3070218Z at com.facebook.react.bridge.ModuleHolder.destroy(ModuleHolder.java:109) ... 2026-02-10T08:53:41.3076089Z at com.facebook.react.bridge.queue.MessageQueueThreadImpl$4.run(MessageQueueThreadImpl.java:234) 2026-02-10T08:53:41.3076643Z at java.lang.Thread.run(Thread.java:920)

// Callers receive raw pointers via .get().
// 1. Manually deleting them would cause a double-free when the static unique_ptrs are destroyed at program exit.
// 2. Resetting the static unique_ptrs is also unsafe because other EP instances or InferenceSession objects
// may still hold raw pointers to these same objects (handed out via domain_list).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this indicate a different problem of someone calling to destroy objects that are in-use? Should we fix that bug?

Another question, static objects would be destroyed just prior to this DLL being unloaded. We want to make sure that the entities being destroyed do not refer to another DLL that could potentially be unloaded first.

It is for the reason people usually introduce a special API to have control of the process and to destroy things at a safe time and not to delegate it to a OS dependent specifics when shared objects are unloaded and the order of static destruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this indicate a different problem of someone calling to destroy objects that are in-use?

Yes, this is a potential use-after-free scenario. I think it should get mitigated with current change.

We want to make sure that the entities being destroyed do not refer to another DLL
usually introduce a special API to have control of the process and to destroy things at a safe time

I see your point. Usually, we could have ref-counted concerned objects for handling this (or, make them part of EP instance, or session to avoid shared usage). However, I believe no cross-DLL memory is actually accessed during destruction today.

I think it will be better to decouple current change about avoid-repetition handling with any potential design changes on this part. Please let me know if this sounds okay to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here my take on this. There is not a firmly defined policy here on handling these objects. I think we need to make a choice here:

  • Remove the Release functions and give away shared_ptrs OR
  • Use the Release functions so client code can destroy the objects when it KNOWS that raw pointers are no longer in use.

Until that happens, this is going to be never-ending chasing of the tail with different OS dependent issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants